Chapter 





3.1 INTRODUCTION 


As discussed in the previous chapter, Pandas 
is a well established Python Library used for 
manipulation, processing and analysis of 
data. We have already discussed the basic 
operations on Series and DataFrame like 
creating them and then accessing data from 
them. Pandas provides more powerful and 
useful functions for data analysis. 

In this chapter, we will be working with 
more advanced features of DataFrame like 
sorting data, answering analytical questions 
using the data, cleaning data and applying 
different useful functions on the data. Below 
is the example data on which we will be 
applying the advanced features of Pandas. 
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Case Study 

Let us consider the data of marks scored in unit tests 
held in school. For each unit test, the marks scored by 
all students of the class is recorded. Maximum marks 
are 25 in each subject. The subjects are Maths, Science. 
Social Studies (S.St.), Hindi, and English. For simplicity, 
we assume there are 4 students in the class and the 
table below shows their marks in Unit Test 1, Unit Test 
2 and Unit Test 3. Table 3.1 shows this data. 


Table 3.1 Case Study 








es Dm ee 
Subjects | Test 

Raman 1 22 ZA 18 20 21 
Raman 2 yt 20 1% 22 24 
Raman 3 14 19 15 24 23 
Zuhaire 1 20 17 22 24A 19 
Zuhaire P, ES 13 ZAD 25 LS 
Zuhaire 3 22 18 19 23 13 
Aashravy 1 23 19 20 15 22 
Aashravy 2 24 D2 24 17 21 
Aashravy 3} i ZS) 19 2i Zo 
Mishti 1 Lo Lp 25 22 22 
Mishti 2 18 21 25 24 23 
Mishti 3 lake 18 20 29 20 


Let us store the data in a DataFrame, as shown in 
Program 3.1: 


Program 3-1 Store the Result data in a DataFrame called marksUT. 


>>> import pandas as pa 


>>> marksUT= {'Name':['Raman', 'Raman', 'Raman', 'Zuhaire', 'Zuhaire','Zu 
haire', ‘Ashravyf “Mshravy', 'Ashravy', 'Mishti','Mishti', 'Mishti'], 


‘US'S 1, 2 3,1,2,3,1,2,3,1,2,3], 
WMS” = | 22421514, 20, 2572225724412; Lyle, tl, 
Cece” Fy 20, Ay gp hoy og 1 ye eye Oye pe hy |e 
Oo feo, ey Loy py pea Oy One a 
"Hindi (204 ZZ y 24924 pp 7 oy 2 5y 1 Op dy Zhy 2272542517 
"Eng = L2lyzo, cog ploy doy ec y elope cy coy 
i 

>>> df=pd.DataFrame (marksUT) 

>® print (df) 
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Name UT Maths Science S.St Hindi Eng 


O Raman 1 da 21 18 20 ZI 
L Raman 2 21 20 17 22 24 
Z Raman 3 14 19 Lo 24 Zo 
3 Zuhaire al 20 17 we 24 19 
4 Zuhaire 2 ZO 15 21 AD 15 
è Zuhaire 3 Zz Lo 193 Pe. 13 
6 Ashravy 1 7.2 19 20 LS A2 
7 Ashravy 2 24 Az 24 L4 21 
8 Ashravy 3 L2 25 19 21 Zo 
9 Mishti 1 15 22 Aa A2 22 
10 Mishti 2 18 21 2 24 23 
11 Mishti. 3 17 18 20 25 20 


3.2 DESCRIPTIVE STATISTICS 


Descriptive Statistics are used to summarise the given 
data. In other words, they refer to the methods which 
are used to get some basic idea about the data. 

In this section, we will be discussing descriptive 
statistical methods that can be applied to a DataFrame. 
These are max, min, count, sum, mean, median, mode, 
quartiles, variance. In each case, we will consider the 
above created DataFrame df. 


3.2.1 Calculating Maximum Values 


DataFrame.max() is used to calculate the maximum 
values from the DataFrame, regardless of its data types. 
The following statement outputs the maximum value of 
each column of the DataFrame: 

>>> print (df.max() ) 


Name Zuhaire #Maximum value in name column 

# (alphabetically) 
UT 3 #Maximum value in column UT 
Maths 24 #Maximum value in column Maths 
Science 295 #Maximum value in column Science 
Deol > #Maximum value in column S.St 
Hindi Zo #Maximum value in column Hindi 
Eng 24 #Maximum value in column Eng 


dtype: object 

If we want to output maximum value for the columns 
having only numeric values, then we can set the 
parameter numeric_only=True in the max() method, as 
shown below: 
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Peo Prinlidt.Mex(numerrvc only=lTrue) ) 


UT e 
Maths 24 
Science 25 
ore L 25 
Hindi as 
Eng 24 


dtype: int64 


Program 3-2 Write the statements to output the 
maximum marks obtained in each subject 
in Unit Test 2. 


>>> dfUT2 = df[df.UT == 2] 


>>> print('\nResult of Unit Test 2: 
win dIUT2) 


Result of Unit Test 2: 
Name UT Maths Science $.St Hindi Eng 





1 Raman 2 AI 20 17 a7 24 
4 Zuhaire 2 23 IES 21 ZO 15 
7 Ashravy 2 24 22 24 17 21 
10 Mishti 2 18 PA 25 24 23 

The output of Program 

3.2 can also be 292 
achieved using the >>> print('\nMaximum Mark obtained in 
following statements Each Subj ecin Test 2: \n\n' ra L JT: 


>>> d£UT2=d£ idi max (numer @gon]ly=iPrue) ) 


['UT']==2] .max 
(numeric only# Tue) 


Maximum Mark obtained in Each Subject in Unit 


>>> print (dfUT2) Test 2: 
UT 2 
Maths 24 
Science A2 
Deo 25 
Hindi Z5 
Eng 24 


dtype: int64 


By default, the max() method finds the maximum 
value of each column (which means, axis=0). However, 
to find the maximum value of each row, we have to 
specify axis = 1 as its argument. 


#maximum marks for each student in each unit 
test among all the subjects 
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>>> di max (axis=1) NOTES 
0 22 
1 24 
2 24 
3 24 
4 25 
5 23 
6 23 
7 24 
8 25 
9 25 
10 25 
11 25 


dtype: int64 


Note: In most of the python function calls, axis = O refers 
to row wise operations and axis = 1 refers to column wise 
operations. But in the call of max(), axis = 1 gives row wise 
output and axis = O (default case) gives column-wise output. 
Similar is the case with all statistical operations discussed 
in this chapter. 


3.2.2 Calculating Minimum Values 


DataFrame.min() is used to display the minimum values 
from the DataFrame, regardless of the data types. That 
is, it shows the minimum value of each column or row. 
The following line of code output the minimum value of 
each column of the DataFrame: 

>>> print (df Wels) ) 


Name Ashravy 
UT il 
Maths L2 
Science Ys 
omoa S 
Hindi 15 
Eng i95 


dtype: object 


Program 3-3 Write the statements to display the 
minimum marks obtained by a particular 
student ‘Mishti’ in all the unit tests for 
each subject. 


>>> dfMishti = df.loc[df.Name == 'Mishti'] 








The output of Program 
3.3 can also be 
achieved using the 
following statements 

>>> dfMishti=df[' 
Maths','Science','S. 


St','Hindi','Eng']||df. 
Name == 'Mishti'].min() 
>>> print(dfMishti) 


22 
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>>> print('\nMarks obtained by Mishti in all 
the Unit Tests \n\n',dfMishti) 


Marks obtained by Mishti in all the Unit Tests 
Name UT Maths Science S.St Hindi Eng 


9 Mishti Ji I 22 Zo ZZ 22 
10 Mishti 2 18 21 Zo 24 Zo 
Iil Mishtr 3 17 18 20 La 20 


>>> print('\nMinimum Marks obtained by 
Mishti in each subject across the unit 
tests\n\n', dfMishtil[['Maths','Science','S. 
Soy ‘Handi » “Eng. .min-()) 


Minimum Marks obtained by Mishti in each subject 
across the unit tests: 


Maths le 
Science 18 
oer 20 
Hindi 22 
Eng X) 


dtype: int64 
Note: Since we did not want to output the min value of 
column UT, we mentioned all the other column names for 
which minimum is to be calculated. 


3.2.3 Calculating Sum of Values 


DataFrame.sum() will display the sum of the values 
from the DataFrame regardless of its datatype. The 
following line of code outputs the sum of each column 
of the DataFrame: 

a> wrint(df.sum()) 


a E EE E E E T E 
UT 24 

Maths Aol 
Science eeN 
Doc 245 
Hindi 262 
Eng 246 


dtype: object 


We may not be interested to sum text values. So, 
to print the sum of a particular column, we need to 
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specify the column name in the call to function sum. 
The following statement prints the total marks of 
subject mathematics: 

>>> print(dft['Maths'].sum() ) 

ee aN 


To calculate total marks of a particular student, the 
name of the student needs to be specified. 


Program 3-4 Write the python statement to print 
the total marks secured by raman in 
each subject. 


>>> dfRaman=dft[dft['Name']=='Raman' | 


>>> print(“Marks obtained by Raman in eag test 
are:\n”, dfRaman) 


Marks obtained by Raman in each test are: 
Name UT Maths Science S.St. Hindi Eng 


QO Raman 1 22 21 18 20 21 
1 Raman f ZA 20 L3 22 24 
2 Raman 3 14 19 Po 24 23 


>>> dfRaman[['Maths', 'Science','S. 


St "Hindi; "Eng']1-sunh 
Maths Oe 
Science 60 
oor 50 
Hindi 66 
Eng 68 


dtype: int64 


#To print total marks scored by Raman in all 
subjects in each Unit Test 


>>> dfRaman[['Maths', 'Science','sS. 


St', 'Hindi', 'Engøatasum(axis=1) 
O 102 
1 104 
2 25 


dtype: int64 


3.2.4 Calculating Number of Values 


DataFrame.count() will display the total number of 
values for each column or row of a DataFrame. To count 
the rows we need to use the argument axis=1 as shown 
in the Program 3.5 below. 













Think and Reflect 





Can you write a 
shortened code to get 
the output of Program 
3.4? 


Activity 3.1 a 


Write the python 
statements to print 
the sum of the english 
marks scored by 
Mishti. 
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NOTES Poo Print (di .Count])] 
Name 12 
UT eZ 
Maths LZ 
Science 12 
Dent 12 
Hindi E 
Eng LZ 


dtype: int64 


Program 3-5 Write a statement to count the number of 
values in a row. 


>>> df.count (axil) 


o x om FF WN EF OO 


O 


10 
1i 
dtype: int64 


NJ N N N yNyNyyy WY yu N JN 


3.2.5 Calculating Mean 


DataFrame.mean() will display the mean (average) of 
the values of each column of a DataFrame. It is only 
applicable for numeric values. 

>>> dL:mean() 


UT 2 s000 
Maths 18.6000 
Science 19.8000 
Seok 20.0000 
Hindi 4l SAO 
Eng 19.8000 


dtype: floato4 


Program 3-6 Write the statements to get an average 
of marks obtained by Zuhaire in all the 
Unit Tests. 
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>>> dfZuhaireMarks = dfZuhaire. 
lọoc[:; Matchs" "Eng" | 


>>> print("Slicing of the DataFrame to get only 
the marks\n", dfZuhaireMarks) 





Slicing of the DataFrame to get only the marks 
Maths Science S.St Hindi Eng 


J 20 17 Ze 24 ig 
4 23 Lo 21 25 15 
5 22 LS 19 23 L> 


>>> print ("Average of marks obtained by 
Zuhaire in all Unit Tests \n", dfZuhaireMarks. 
mean (axis=1)) 


Average of marks obtained by Zuhaire in all 
Unit Tests 


3 20,4 
4 19.8 
5 19.0 Think and Reflect 








dtype: floato4 





Try to write a short 
code to get the above 
output. Remember 

to print the relevant 
headings of the output. 


In the above output, 20.4 is the average of marks 
obtained by Zuhaire in Unit Test 1. Similarly, 19.8 and 
19.0 are the average of marks in Unit Test 2 and 3 
respectively. 


3.2.6 Calculating Median 


DataFrame.Median() will display the middle value of the 
data. This function will display the median of the values 
of each column of a DataFrame. It is only applicable for 
numeric values. 

>>> print (df.median() ) 


UT Da 
Maths 19.0 
Science 20.0 
SPECIE 19% 
Hindi Ades 
Eng 21.0 


dtype: float64 


Program 3-7 Write the statements to print the median 
marks of mathematics in UT1. 


>>> dfMaths=df['Maths'] 





Activity 3.2 < 


Find the median of the 
values of the rows of 
the DataFrame. 


Activity 3.3 < 


Calculate the mode 
of marks scored in 
Maths. 
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>>> dfMathsUTl=dftMaths [df.UT==1] 


>>> print("Displaying the marks scored in 
Mathematics in UT1\n",dfMathsUT1) 


Displaying the marks of UT1, subject 
Mathematics 


O LA 
> 20 
6 vag: 
9 ibe 


Name: Maths, dtype: int64 


>>> dfMathMedian=dfMathsvUT1.median () 


>>> print("Displaying the median of Mathematics 
in UT1\n”,dfMathMedian) 


Displaying the median of Mathematics in UTI 

21.0 

Here, the number of values are even in number 
so two middle values are there i.e. 20 and 22. Hence, 
Median is the average of 20 and 22. 


3.2.7 Calculating Mode 


DateFrame.mode() will display the mode. The mode is 

defined as the value that appears the most number of 

times in a data. This function will display the mode of 

each column or row of the DataFrame. To get the mode 

of Hindi marks, the following statement can be used. 
>>> W| ®indi'] 

20 

ZZ 

24 

24 

Zo 

Zo 

Lo 

L7 

21 

ZZ 

10 24 

11 ZO 

Name: Hindi, dtype: int64 

>>> df['Hindi'].mode () 


Con DD OO A OO Les O 


LO 
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0 24 NOTES 
dtype: int64 


Note that three students have got 24 marks in Hindi 
subject while two students got 25 marks, one student 
got 23 marks, two students got 22 marks, one student 
each got 21, 20, 15, 17 marks. 


3.2.8 Calculating Quartile 


Dataframe.quantile() is used to get the quartiles. It 
will output the quartile of each column or row of the 
DataFrame in four parts i.e. the first quartile is 25% 
(parameter q = .25), the second quartile is 50% (Median), 
the third quartile is 75% (parameter q = .75). By default, 
it will display the second quantile (median) of all 
numeric values. 


>>> df.quantile() # by default, median is the 


OU Cpu 

UT 2.0 
Maths Aaa 
Science eS pes 
Gent Zi ei) 
Hindi Cs 
Eng aN) 


Name: 0.5, dtype: floato4 


>>> df.quantyle (q=.25) 


UT 1.00 
Maths No Se 
Science 13.00 
Deo Ow T5 
Hindi 2079 
Eng La ie 


Name: 0.25, dtype: floato4 


>>> df.quantile(q=.75) 


UT 32.00 
Maths P OR ae. 
Science Zee 
Sesh AA aU 
Hindi 24.00 
Eng 2300 


Name: 0.75, dtype: float64 





Activity 3.4 a 


Find the variance and 
standard deviation of 
the following scores on 
an exam: 92, 95, 85, 
80, 75, 50. 
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Program 3-8 Write the statement to display the first and 
third quartiles of all subjects. 


>>> dfSubject=dt[['Maths','Science','S. 
ot), Hindi; "Eng J | 

>>> print("Marks of all the subjects: \ 
n",dfSubject) 


Marks of all the subjects: 
Maths Science S.St Hindi Eng 


O 22 2l 18 20 21 
1 ZA 20 17 ae 24 
2 14 19 E 24 23 
2 20 L7 Ze 24 L9 
4 23 1 ZA Zo Re 
5 22 18 19 23 13 
6 23 19 v0 L3 22 
7 24 22 24 cle, £1 
8 i Zo rg al 2a 
> h AZ 25 Ae. ZZ 
10 18 21 S 24 Zo 
LL 17 18 2 O 25 20 


>>> dfO=dfSubject.quantile([.25,.75]) 


>>> print ("Farswand tmrd quartiles of all the 
subjects: \n", d£Q) 


First and third quartiles of all the subjects: 


Maths Science Seow Hindi Eng 
Oaa Oa () ISe00 1379 LUID 1979 
ON 22.25 Zlues ALDO 2400 23.00 


3.2.9 Calculating Variance 


DataFrame.var() is used to display the variance. It is the 
average of squared differences from the mean. 


>>> df[['Maths', 'Science','S. 
Duy HINA bndl] eau) 


Maths 15.840909 
Science 7.113636 
SDE 9.901515 
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Hindi 9.909697 
Eng L1,365636 
dtype: floato4 


3.2.10 Calculating Standard Deviation 


DataFrame.std() returns the standard deviation of the 
values. Standard deviation is calculated as the square 
root of the variance. 


>>> dft[['Maths','Science','S. 
poy Hondas “Eng” Iled) 


Maths 3.980064 
Science 2.667140 
Deir 3.146667 
Hindi 3.157483 
Eng Seo ee CRS, 


dtype: floato4 

DataFrame.describe() function displays the 
descriptive statistical values in a single command. These 
values help us describe a set of data in a DataFrame. 


>>> df.describe() 


UT Maths Science SNDL Hindi Eng 
count 12.000000 12.000000 12.00000 12.000000 12.000000 12.000000 
mean 2.000000 1% TMe 19.750007 2.067 21.833333 20.500000 
std 2.652503 3.980064 2.66714 3.146667 3.157483 54570909 
min 1.000000 12.000000 15.00000 15.000000 15.000000 13.000000 
25% 1.0@0@00 We.500000 18.0000% 18.750000 20.750000 19.750000 
50% 2.0Q0@007 0.500000 Mrs 50000 20.000000 22.500000 21.500000 
Tos 3.000098f 22.250000% 1.5000 22.500000 24.000000 23.000000 
max 3.000000 24.000000 25.00000 25.000000 25.000000 24.000000 


3.3 DATA AGGREGATIONS 


Aggregation means to transform the dataset and produce 
a single numeric value from an array. Aggregation can 
be applied to one or more columns together. Aggregate 
functions are max(),min(), sum(), count(), std(), var(). 


>>> df.aggregate('max') 


Name Zuhaire # displaying the maximum of Name 
as well 


UG 3 
Maths 24 
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NOTES Science 29 
Deol 25 
Hindi 49 
Eng 24 


dtype: object 


#TO use multiple aggregate functions ina 
Single statement 


>>> df.aggregate(['max', 'count']) 


Name UT Maths Science S.St Hindi Eng 
max dZuhaire 3 24 25 Zo Z5 24 
count 12 12 I2 12 12 12 K? 


>>> df['Maths'].aggregate (['max', 'min']) 
max 24 

min 12 

Name: Maths, dtype: int64 


Note: We can also use the parameter axis with 
aggregate function. By default, the value of axis is zero, 
means columns. 


#Using the above statement with axis=0 gives 
th same Besult 


>>> df["'Maths'].aggregate (['max', 'min'],axis=0) 
max 24 
min il 


Name: Maths, dtype: int64 


#Total marks of Maths and Science obtained by 
each student. 


#Use sum() with axis=1 (Row-wise summation) 


>>> dft[['Maths', 'Science']]. 
aggregate ('sum',axis=1) 


O 43 
1 41 
2 33 
3 oy, 
4 Sie 
9 40 
6 42 
q 46 
8 a7 
2 oe 
10 39 
11 e 


dtype: int64 
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3.4 SORTING A DATAFRAME 





Sorting refers to the arrangement of data elements in 
a specified order, which can either be ascending or 
descending. Pandas provide sort_values() function to 
sort the data values of a DataFrame. The syntax of the 
function is as follows: 


Dalakrame.sort values (by, axis=0, ascending=lfrue) 


Here, a column list (by), axis arguments (O for rows 
and 1 for columns) and the order of sorting (ascending 
= False or True) are passed as arguments. By default, 
sorting is done on row indexes in ascending order. 

Consider a scenario, where the teacher is interested 
in arranging a list according to the names of the students 
or according to marks obtained in a particular subject. 
In such cases, sorting can be used to obtain the desired 
results. Following is the python code for sorting the data 
in the DataFrame created at program 3.1. 

To sort the entire data on the basis of attribute 
‘Name’, we use the following command: 


#By default, sorting is done in ascending order. 


>>> print(df.sort values (by=['Name']) ) 


Name UT Maths Science S.St Hindi» Eng 


6 Ashravy 1 23 L3 20 15 d2 
7 Ashravy P 24 22 24 17 21 
8 Ashravy 3 12 25 h Za 23 
9 Mishti 1 15 22 25 22 22 
10 Mishti z 18 Al 25 24 23 
LI Mishti 3 17 18 ZX) 25 20 
0 Raman 1 ZL 21 18 20 24 
1 Raman 2 ZA ag 177 ZZ 24 
2 Raman 3 14 19 15 24 L3 
3 Zuhaire 1 20 L7 da 24 T9 
4 Zuhaire 2 7 3 15 21 Zo 19 
5 Zuhaire 3 AZ 18 19 23 13 


Now, to obtain sorted list of marks scored by all 
students in Science in Unit Test 2, the following code 
can be used: 

# Get the data corresponding to Unit Test 2 

>>> dafUT2 = dfl[df.UT == 2] 


# Sort according to ascending order of marks in 
Science 
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>>> Prane(GLUl2 sort: Values (by=[" science” ]),) 


Name UT Maths Science S.St Hindi Eng 


4 Zuhaire Z 23 15 21 Zo Lo 
1 Raman Z 21 20 17 Ae, 24 
10 Mishti 2 18 21 25 24 23 
7 Ashravy 2 24 ZZ 24 ice Zs 


Program 3-9 Write the statement which will sort the 
marks in English in the DataFrame df 
based on Unit Test 3, in descending order. 

# Get the data corresponding to Unit Test 3 
>>> dfUT3 = df[df.UT == 3] 


# Sort according to descending order of marks in 


Science 
>>> print (dfUT3.so@t Walues (by=[‘'Engy' |] Sagtemding=F 
alse) ) 

Name UT Maths Science S.St Hindi Eng 
2 Raman $ 14 13 | 24 72 
8 Ashravy 3 12 Do 19 2i 23 
11 Mishti 3 17 18 20 ZO 20 
5 Zuhaire 3 2a 18 19 Ae lo 


A DataFrame can be sorted based on multiple 
columns. Following is the code of sorting the DataFrame 
df based on marks in Science in Unit Test 3 in ascending 
order. If marks in Science are the same, then sorting 
will be done on the basis of marks in Hindi. 

# Get the data corresponding to marks in Unit Test 
3 
>> Mee emo dfn. UT == 3] 


# Sort the data according to Science and then 
according to Hindi 


>77 Brine (adLUl3S.SOre 


values (by=['Science', 'Hindi']) ) 


Name UT Maths Science S.St Hindi Eng 


3 Zuhaire 3 22 18 19 ae L3 
11 Mishti a LJ 18 20 25 20 
2 Raman 3 14 19 15 24 23 
8 Ashravy 3 L2 25 19 Al 23 


Here, we can see that the list is sorted on the basis 
of marks in Science. Two students namely, Zuhaire and 
Mishti have equal marks (18) in Science. Therefore for 
them, sorting is done on the basis of marks in Hindi. 
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3.5 GROUP BY FuncrTions NOTES 


In pandas, DataFrame.GROUP BY() function is used 
to split the data into groups based on some criteria. 
Pandas objects like a DataFrame can be split on any 
of their axes. The GROUP BY function works based on 
a split-apply-combine strategy which is shown below 
using a 3-step process: 


Step 1: Split the data into groups by creating a GROUP 
BY object from the original DataFrame. 


Step 2: Apply the required function. 


Step 3: Combine the results to form a new DataFrame. 


To understand this better, let us consider the data 
shown in the diagram given below. Here, we have a two- 
column DataFrame (key, data). We need to find the sum 
of the data column for a particular key, i.e. sum of all 
the data elements with key A, B and C, respectively. To 
do so, we first split the entire DataFrame into groups 
by key column. Then, we apply the sum function on the 
respective groups. Finally, we combine the results to 
form a new DataFrame that contains the desired result. 


split Apply Combine 
key data 





Sum 





Figure 3.1: A DataFrame with two columns 


The following statements show how to apply GROUP 
BY() function on our DataFrame df created at Program 
34: 


#Create a GROUP BY Name of the student from 
DataFrame df 


>>> gl=df.GROUP BY ('Name') 
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NOTES 


#Displaying the first entry from each group 
>>> Ol girs t.() 
UT Maths Science S$.St Hindi Eng 


Name 

Ashravy 1 L3 19 20 15 Ze 
Mishti il 15 22 25 22 22 
Raman 1 da 21 18 20 ZA. 
Zuhaire 1 AQ 17 Ze 24 19 


#Displaying the size of each group 
>>> gl.size() 

Name 

Ashravy 3 

Mishti 3 

Raman 3 

Zuhaire 3 

dtype: int64 


#Displa g group data, lige Wagueup name, row 
indexes corresponding to the group and their 
data type 


>> 91 .grðups 


{'Ashravy': Into4index([6, 7, 8], 
dtype='into4'), 


"Mishti': Inte4index([9, 10, 11], 
dtype='into4'), 


"Raman': Inte4Iindex([0, 1, 2], dtype='int64'), 


"Zuhaire': Into4Iindex([3, 4, 5], 
dtype='int64") } 


#Printing data of a single group 
>>> gl.get group ('Raman') 
UT Maths Science S.St Hindi Eng 


O i 22 al 18 20 Al 
1 2 al 20 L3 LA 24 
Z e 14 19 La 24 Z3 


#Grouping with respect to multiple attributes 
#Creating a GROUP BY Name and UT 


>>> g2=df.GROUP BY(['Name', 'UT']) 


Poe G2.Nrer() 
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Maths Science S.st Hindi Eng NOTES 
Name UT 
Ashravy 1 23 19 20 15 22 
2 24 22 24 L7 21 
5 12 2 19 ZA. 722 
Mishta 1 Re. dd 25 AA 27 
2 18 21 25 24 A3 
3 17 18 ZN) 25 20 
Raman ii Ze 2q 18 20 zA 
2 af. 20 L7 22 24 
3 14 193 15 24 22 
Zuhaire 1 AQ 177 R2 24 Ly 
Z ZO Lo 21 25 Rs 
3 22 18 19 23 is: 


The above statements show how we create groups by 
splitting a DataFrame using GROUP BY(). Next step is 
to apply functions over the groups just created. This is 
done using Aggregation. 

Aggregation is a process in which an aggregate 
function is applied on each group created by GROUP 
BY(). It returns a single aggregated statistical value 
corresponding to each group. It can be used to apply 
multiple functions over an axis. Be default, functions 
are applied over columns. Aggregation can be performed 
using agg() or aggregate() function. 


#Calculating average marks scored by all 
students in each subject for each UT 


>>> df.GROUP BY(['UT']) .aggregate ('mean') 
Maths Science S. Sd, gae Eng 

UT 

1 20 00 19.75 21.%™®ẹ»=£0.25 21.00 

2 41 50 Lo. Diet TO UZ OO 20a 

3 IG. 25 20.80 B8B.25 23.25 19.75 


#Calculate average marks scored in Maths in 
each UT 


>>> groupl=df.GROUP BY(['UT']) 
>>> groupl|['Maths'].aggregate('mean') 


UE 

1 Aa O 
2 gO) 
2 165.20 


Name: Maths, dtype: floato4 





Activity 3.5 > 


Write the python 
statements to print 
average marks in 
Science by all the 
students in each UT. 
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Program 3-10 Write the python statements to print the 
mean, variance, standard deviation and 
quartile of the marks scored in Mathematics 
by each student across the UTs. 


>>> df.GROUP BY (by='Name') [|'Maths'].agg(['mean', 'v 
ar', 'std', 'quantile']) 


mean var Seca quantile 
Name 
Ashravy 19.666667 44.333333 6.658328 23 gle 
Mishti 160:06006007 Deo OD Leow fo 17.0 
Raman 19.000000 19.000000 4.358899 2 TRO 
Zuhaire21.666667 fed ooo dete too aa, O 


3.6 ALTERING THE INDEX 


We use indexing to access the elements of a DataFrame. 
It is used for fast retrieval of data. By default, a numeric 
index starting from O is created as a row index, as shown 
below: 


So di #With default Index 
Name UT Maths Science S.St Hindi Eng 
O Raman L 22 21 18 20 21 
d Raman Z AT 20 17 2 24 
2 Raman 3 14 19 Lo 24 A3 
3 Zuhaire 1 20 17 22 24 19 
4 Zuhaire 2 ie 15 al ine 15 
z Zuhaire i AZ 18 19 A 13 
6 Ashravy 1 23 19 20 ie) 22 
7 Ashravy 2 24 22 24 17 2l 
8 Ashravy 3 L2 An 19 ZA. 22 
9 Mig@ti 1 15 A2 a ae 22 
10 Mishti 2 18 21 Zo 24 23 
11 Mishti 3 17 18 20 wae. 20 


Here, the integer number in the first column 
starting from O is the index. However, depending on our 
requirements, we can select some other column to be 
the index or we can add another index column. 

When we slice the data, we get the original index 
which is not continuous, e.g. when we select marks of 
all students in Unit Test 1, we get the following result: 


>>> dfUT1 = df[df.UT == 1] 
>>> print (d£UT1) 
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Name UT Maths Science S$.St Hindi Eng 


O Raman il Ld Al 18 20 21 
3 Zuhaire 1 20 17 22 24 r9 
6 Ashravy 1 e 19 20 15 22 
9 Misti 1 15 22 Zz 22 22 


Notice that the first column is a non-continuous 
index since it is slicing of original data. We create a new 
continuous index alongside this using the reset_index() 
function, as shown below: 

27> dTUTI:freseC index(inplace=[rue) 
>>> print (d£UT1) 


index Name UT Maths Science S.St Hindi Eng 
O O Raman 1 22 21 18 20 21 
1 3 Zuhaire 1 20 1 Aa 24 19 
Z 6 Ashravy dl no Lg AO 15 2A 
3 9 Mishti 1 15 gh 25 22 2 


A new continuous index is created while the original 
one is also intact. We can drop the original index by 
using the drop function, as shown below: 


>>> dfUT1.drop(columns=[ ‘index’ ],inplace=True) 
>> Print (dr UTL) 


Name UT Maths» Science S.St Hindi Eng 


O Raman 1 22 21 18 20) 21 
1 Zuhaire 1 AQ 17 g2 24 19 
2 Ashravy 1 Zo 19 20 iig ae. 
3 Mishti 1 ies: Zz ee AZ 22 


We can change the index to some other column of 
the data. 
>>> dfUT1.set index ('Mẹhe',®%nplace=True) 
>>> print (dfUT1) 
UT Maths Science $.St Hindi Eng 


Name 

Raman 1 2 A Al 18 20 21 
Zuhaire 1 20 JRE, Ze 24 19 
Ashravy 1 23 19 20 15 Ze 
Mishti 1 15 22 Zo 22 22 


We can revert back to previous index by using 
following statement: 


>>> dfUTl.reset index('Name', inplace = True) 
>>> Prin (HUT) 
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Name UT Maths Science S.St Hindi Eng 


O Raman ii 22 21 18 20 ZA. 
1 Zuhaire i 20 17 LL 24 LY 
2 Ashravy 1 Zo 19 20 ies 22 
3 Mishti 1 is 22 25 A2 22 


3.7 OTHER DATAFRAME OPERATIONS 


In this section, we will learn more techniques and 
functions that can be used to manipulate and analyse 
data in a DataFrame. 


3.7.1 Reshaping Data 


The way a dataset is arranged into rows and columns is 
referred to as the shape of data. Reshaping data refers 
to the process of changing the shape of the dataset 
to make it suitable for some analysis problems. The 
example given in the below section explains the utility 
of reshaping the data. 

For reshaping data, two basic functions are available 
in Pandas, pivot and pivot_table. This section covers 
them in detail. 

(A) Pivot 

The pivot function is used to reshape and create a new 
DataFrame from the original one. Consider the following 
example of sales and profit data of four stores: S1, S2, 
S3 and S4 for the years 2016, 2017 and 2018. 


Example 3.1 
>>> import pandas as pd 


>a Ree 'Store':['S1', 'S4', 'S3','S1', 'S2','S3 
fom, O25 "So ),. "Year"? (2016,2016,2016, 2017 
mel, 2017,2018,2018,2018], 


"Total Sales (Rs) [120007320000 420000, 
20000,10000, 450000, 30000, 11000,89000], 


"Total profit ( 
RS) 7 [T100;,5500,21000, 52000, 9000,45000, 5000, 
1L900,.23000 | 


} 


>>> df=pd.DataFrame (data) 
>>> Prine (ar) 


Store Year Total sales (Rs) Total profit (Rs) 
O ol 2016 12000 1100 
i S4 2016 330000 5500 


2016 420000 21000 
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z S1 Z017 20000 32000 
4 ga 2017 10000 9000 
z oe v0 pil, 450000 45000 
6 al 2018 30000 3000 
7 2 201g 11000 t900 
8 ok 2018 89000 23000 


Let us try to answer the following queries on the 
above data. 


1) What was the total sale of store S1 in all the years? 
Python statements to perform this task will be 
as follows: 


# will get the data related to store S1 
>>> Sidf = df[df.Store=="S1" ] 
#find the total of sales for Store S1 
>>> oldi [Total sales (R9) ]-sum() 
62000 


2) What is the maximum sale value by store S3 in 
any year? 


#will get the data related to store S3 
>>> S3df = df[df.Store==" $37 ] 

#find the maximum sale for Store S3 

>>> S3df[ ‘Total saldea W. max () 
450000 


3) Which store had the maximum total sale in all 
the years? 


>>> Sldf = df Ted Store=='S1 7f] 

>>> S2df=dt[df.Store == 'S2"] 

>>> S3df = df[df.Store=='S3'] 

>>> S4df = dft[df.Store=='S4"'] 

>>> Sltotal = Sldf['TotaNgsales (RS)"] .sum() 

>>> SZtotal = S2@rPRMhotal sales(Rs)'].sum() 
oo > SOLOLaL o3dm[‘'Total sales (Rs)'].sum() 
>>> S4totem = >I ['Total sales(Rs) '].sum() 

>>> max (Sltotai@e2total,S3total,S4total) 

959000 


Notice that we have to slice the data corresponding to 
a particular store and then answer the query. Now, let 
us reshape the data using pivot and see the difference. 
>>> 


pivotl=df.pivot (index='Store',columns='Year',va 
lūües=" Total sales tks) ) 





Activity 3.6 <a 





Consider the data of 
unit test marks given 
at program 3.1, write 
the python statements 
to print name wise UT 
marks in mathematics. 
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Here, Index specifies the columns that will be acting 
as an index in the pivot table, columns specifies the 
new columns for the pivoted data and values specifies 
columns whose values will be displayed. In this 
particular case, store names will act as index, year 
will be the headers for columns and sales value will be 
displayed as values of the pivot table. 


>>> Print (plvoul) 


Year 2016 2017 2018 
Store 

efi T2000; 0 2000040 30000.0 
52 NaN 10000.0 11000.0 
Da 420000.0 450000.0 89000.0 
S4 330000-0 NaN NaN 


As can be seen above, the value of Total_sales (Rs) 
for every row in the original table has been transferred 
to the new table: pivotl, where each row has data of a 
store and each column has data of a year. Those cells in 
the new pivot table which do not have a matching entry 
in the original one are filled with NaN. For instance, we 
did not have values corresponding to sales of Store S2 
in 2016, thus the appropriate cell in pivot! is filled with 
NaN. 


Now the python statements for the above queries will 
be as follows: 


1) What was the total sale of store S1 in all the years? 
>>> pivot Maloc[‘S1’].sum() 


2) What is the maximum sale value by store $3 in 
any year? 
ee eemevotl.loc[‘S3’ ].max() 


3) Which store had the maximum total sale? 
>>> Sltotal = pivotl.loc['S1'].sum() 
>>> DAtOtal = pavorl. hoc "S62" )]<suam() 
>>> D3tOtal = pavyoul,.toc[*52').sum() 
>>> S4total = pivotl.loc['S4'].sum() 
>>> max(Sltotal;Sz2total;S3totťtal;S4total) 


We can notice that reshaping has transformed the 
structure of the data, which makes it more readable 
and easy to analyse the data. 


(B) Pivoting by Multiple Columns 
For pivoting by multiple columns, we need to specify 
multiple column names in the values parameter of 
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pivot() function. If we omit the values parameter, it will 
display the pivoting for all the numeric values. 





>>> pivot2=df.pivot (index='Store',columns='Year 
', velues=(["Toral. Sales (ks) y Total prouc(Rs) I) 


ao print (p1vocZ) 


Total sales (Ks) Total profit (Rs) 
year 2016 2017 2018 ZOLL 2017 2018 
>LCOre 
S1 12000,0 200000 30000%0 110030 32000 40 5000.0 
S2 NaN LO000.0° 1100040 NaN 9000.0 1900.0 
S3 330000-0 NaN NaN gaio orao NaN NaN 


Let us consider another example, where suppose we 
have stock data corresponding to a store as: 

>>> data={'Item':['Pen', 'Pen', 'Pencil','Pencil' 

,'Pen', 'Pen'], 

‘Color':['Red','Red!', 'Black', 'Blackah, 'Blue','B 

lue'], 

Price (RS) "i PbO, 25, 775, 007-2019 

‘Units in stock':[50,10, 47/934, 55,14] 

} 

>>> df=pd.DataFrame (data) 

>>> prance (as) 


Item Color Prę®e (KRW Units im sick 
O Pen Red 10 pO 
1 Pen Red 25 10 
2 Pencil Black J 47 
3 Pencil Black 3 34 
4 Pen Blue 50 eis 
z Pen Blue LN 14 


Now, let us assume, we have to reshape the above 
table with Item as the index and Color as the column. 
We will use pivot function as given below: 

>>> pivot3=df.pivot (index='Item',columns='Color 
',values='Un i in Stock') 

But this statement results in an error: “ValueError: 
Index contains duplicate entries, cannot reshape”. This 
is because duplicate data can’t be reshaped using pivot 
function. Hence, before calling the pivot() function, we 
need to ensure that our data do not have rows with 
duplicate values for the specified columns. If we can’t 
ensure this, we may have to use pivot_table function 
instead. 
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sum 
COLOr Black Blue 
Item 
Pen NaN G90 
Pencil 81.0 NaN 


(C) Pivot Table 

It works like a pivot function, but aggregates the values 
from rows with duplicate entries for the specified 
columns. In other words, we can use aggregate functions 
like min, max, mean etc, wherever we have duplicate 
entries. The default aggregate function is mean. 


Syntax: 

pandas.pivot table(data, values=None, 

index=None, columns=None, aggfunc='mean') 

The parameter aggfunc can have values among sum, 
max, min, len, np.mean, np.median. 

We can apply index to multiple columns if we don't 
have any unique column to act as index. 


>>> dfl = dfi.piyy 
table (index=['Item','Color']) 
>>> print (digg 


Price (Rs) Units 1M st@ek 

Item cC or 
Pen Blue Joay 34.5 
Red ilo 3040 
Pencil Black 6.0 40.5 


Please note that mean has been used as the default 
aggregate function. Price of the blue pen in the original 
data is 50 and 20. Mean has been used as aggregate 
and the price of the blue pen is 35 in dfl. 

We can use multiple aggregate functions on the 
data. Below example shows the use of the sum, max 
and np.mean function. 


>>> Sawotgtablel=df.pivot table (index=' 
Pew, Celumns="Color',values='Units in_ 
stock!',aggfunc=[sum,max,np.mean] ) 


27 Pivot Lables 


max mean 
Red Black Blue Red Black Blue Red 


60.0 NaN oe EPAI NaN 34.5 30.0 

NaN 47.0 NaN NaN 40.5 NaN NaN 

Pivoting can also be done on multiple columns. 
Further, different aggregate functions can be applied on 
different columns. The following example demonstrates 
pivoting on two columns - Price(Rs) and Units_in_stock. 
Also, the application of len() function on the column 
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Price(Rs) and mean() function of column Units_in_ NOTES 
stock is shown in the example. Note that the aggregate 

function len returns the number of rows corresponding 

to that entry. 


2> pivot taplel=dr.pivot tapole(index='Item" 
,cOlLumniSs="Color",;velucs=(["*Prace(Rhs) *,; Une 
in SLOCK" | ,eqgrunC={"Prace( ks)”: len, "Unies 1n 
stock":np.mean}) 





>>> pivot tablel 


Price (Rs) Units in stock 
Color Black Blue Red Black Blue Red 
Item 
Pen NaN 2.0 2.0 NaN 34.5 30.0 
Pencil 2.0 NaN NaN 40.5 NaN NaN 


Program 3-11 Write the statement to print the maximum 
price of pen of each color. 


>>> dfpen=df[df.Item=='Pen'] 


>>> pivot redpen=dfpen.pivot table (index='Item' 
,columns=['Color'],values=['Price(Rs)'],aggfun 
c=[max] ) 


>>> print (pivot redpen) 


max 
Price (Rs) 

Color Bluc». Red 

Lcem 

Pen a0 S2 


3.8 HANDLING Miıssıne VALUES 


As we know that a DataFrame can consist of many rows 
(objects) where each row can have values for various 
columns (attributes). If a value corresponding to a 
column is not present, it is considered to be a missing 
value. A missing value is denoted by NaN. 

In the real world dataset, it is common for an object 
to have some missing attributes. There may be several 
reasons for that. In some cases, data was not collected 
properly resulting in missing data e.g some people did 
not fill all the fields while taking the survey. Sometimes, 
some attributes are not relevant to all. For example, if 
a person is unemployed then salary attribute will be 
irrelevant and hence may not have been filled up. 
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NOTES 


Missing values create a lot of problems during data 
analysis and have to be handled properly. The two 
most common strategies for handling missing values 
explained in this section are: 


i) drop the object having missing values, 

ii) fll or estimate the missing value 

Let us refer to the previous case study given at table 
3.1. Suppose, the students have now appeared for 
Unit Test 4 also. But, Raman could not appear for the 
Science, Maths and English tests, and suppose there 
is no possibility of a re-test. Therefore, marks obtained 
by him corresponding to these subjects will be missing. 
The dataset after Unit Test 4 is as shown at Table 3.2. 
Note that the attributes ‘Science, Maths’ and English’ 
have missing values in Unit Test 4 for Raman. 


Table 3.2 Case study data after UT4 


Name/ Unit 
Subjects | Test 
1 22 2il 18 20 21 





Raman 

Raman 2 2l 20 7, 22 24 
Raman 3 14 19 15 24 29 
Raman 4 19 18 

Zuhaire 1 20 17 22 24 19 
Zuhaire 2 23 15 21 25 I5 
Zuhaire b 22 18 19 DKS) I3 
Zuhaire 4 19 20 7 19 16 
Aashravy 1 23 19 20 15 22 
Aashravy 2 24 272. 24 17 2l 
Aashravy 3 LZ 25 19 21 23 
Aashravy 4 15 20 20 20 I7 
Mishti 1 15 22 25 22 22 
Mishti 2 18 21 29 24 25 
Mishti 3 17 18 20 25 20 
Mishti 4. 14 20 19 20 18 


To calculate the final result, teachers are asked to 
submit the percentage of marks obtained by all students. 
In the case of Raman, the Maths teacher decides to 
compute the marks obtained in 3 tests and then find the 
percentage of marks from the total score of 75 marks. 
In a way, she decides to drop the marks of Unit Test 4. 
However, the English teacher decides to give the same 
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marks to Raman in the 4th test as scored in the 3rd 
test. Science teacher decides to give Raman zero marks 
in the 4th test and then computes the percentage of 
marks obtained. Following sections explain the code 
for checking missing values and the code for replacing 
those missing values with appropriate values. 





3.8.1 Checking Missing Values 


Pandas provide a function isnull() to check whether any 

value is missing or not in the DataFrame. This function 

checks all attributes and returns True in case that 

attribute has missing values, otherwise returns False. 
The following code stores the data of marks of all 

the Unit Tests in a DataFrame and checks whether the 

DataFrame has missing values or not. 
>>> marksUT = { 
'Name': ['Raman', 'Raman', 'Raman', 'Raman', 'Zuhaire', 'Zuhaire', 'Zuhaire' 
,'Zuhaire', 'Ashravy', 'Ashravy', 'Ashravy', 'Ashravy', 'Mishti','Mishti', 
Michi) "Mishra. Jy 
'UT': [1,2,3,4,1,2,3,4,1,2,3,47ay &,A4l3 
'Maths':[22,21,14,np.NaN, 203, W, 199@23,24,12,159h5,45,17,141, 
'Science':[21,20,19,np.NaW, 17,15, %8 20,19,22 @5, 2072, 21,18,20], 
'S.St':[18,17,15,19,22 a21 %®9,17 40,24,19, Peep , 220,19), 
"Hindi':[20,22,24,18, 2 M> "km, 15,17, OB, 20 W@2,24,25,20], 
'Eng':[21,24,23,np. Ame] Mao, 13,16,22 man A, 22,23,20,18] } 


>>> df = pd.DataFrame (marksUT) 


>27 Princ (dt.isnult()) 


Output of the above code will be 


Name UT Maths Science Seow hindi Eng 
O False False False False False False False 
1 False False False False False False False 
2 False False False False False False False 
3 False False True True False False True 
4 False False Fakse False False False False 
5 False False False False False False False 
6 False False False False False False False 
7 False False False False False False False 
8 False False False False False False False 
9 False False False False False False False 
10 False False False False False False False 
11 False False False False False False False 
12 False False False False False False False 
13 False False False False False False False 
14 False False False False False False False 
15 False False False False False False False 
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NOTES One can check for each individual attribute also, 
e.g. the following statement checks whether attribute 
‘Science’ has a missing value or not. It returns True for 
each row where there is a missing value for attribute 
‘Science’, and False otherwise. 
>>> print(dft['Science'].isnull()) 





O False 
1 False 
2 False 
3 True 
4 False 
5 False 
6 False 
T False 
8 False 
9 False 
10 False 
11 False 
La False 
IG False 
14 Fale 
RS False 


Name: Science, dtype: bool 


To check whether a column (attribute) has a missing 
value in the entire dataset, any() function is used. It 
returns True in case of missing value else returns False. 


>>> pers (df.isnull().any()) 


Name False 
Ue False 
Maths True 
Science True 
Deo False 
Hindi False 
Eng True 


dtype: bool 


The function any() can be used for a particular 
attribute also. The following statements) returns True 
in case an attribute has a missing value else it returns 
False. 


>>> print (df['Science'].isnull().any()) 


True 
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eo prince(de | hinds’ |).2snull{) «any ()) NOTES 


False 


To find the number of NaN values corresponding to 
each attribute, one can use the sum() function along 
with isnull() function, as shown below: 

>>> print(di.isnull().sum()} 

Name O 

UT 0 

Maths 1 

science 1 

Dao O 

Hindi O 

Eng ii 

dtype: int64 


To find the total number of NaN in the whole dataset, 
one can use df.isnull().sum().sum(). 

>>> print(df.isnull().sum().sum() ) 

3 


Program 3-12 Write a program to find the percentage of 
marks scored by Raman in hindi. 


>>> dfRaman = df[df['Name! ]=='Raman' | 
>>> print('Marks Scored by Raman \n\n',dfRaman) 


Marks Scored by Raman 
Name UT Maths Science S.st Hindi Eng 


O Raman 1 2%. D 210 18 20 2lU 
1 Raman 2 Pgo 20.6 17 22. £4,0 
2 Raman 3 14.0 19.0 15 24 Zeal’ 
3 Raman 4 NaN NaN 19 18 NaN 


>>> dfHindi = dfRaman['Hindi'] 


>>> print ("MarksgsSC@ked by Raman in Hindi 
Vin \n dL Nindi) 


Marks Scored by Raman in Hindi 


O 20 
L 22 
2 24 
3 18 


Name: Hindi, dtype: int64 


>>> row = len(dfHindi) # Number of Unit Tests 
held. Here row will be 4 
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NOTES 


>>> print ("Percentage of Marks Scored by Raman 
in Hindi nn", (dal Hindissum()*100)/ (257 row); "3") 


# denominator in the above formula represents 
the aggregate of marks of all tests. Here row 
is 4 tests and 25 is maximum marks for one test 


Percentage of Marks Scored by Raman in Hindi 
84.0 % 


Program 3-13 Write a python program to find the 
percentage of marks obtained by Raman 
in Maths subject. 


>>> dfMaths = dfRaman['Maths'] 


>>> print("Marks Scored by Raman in Maths 
\n\n", dfMaths) 


Marks Scored by Raman in Maths 


O 22 ð 
1 2 QO 
2 14.0 
3 NaN 


Name: Maths, dtype: float64 


>>> row = len(dfMaths) # here, row will be 4, 
the number of Unit Tests 


>>> print("Percentage of Marks Scored by Raman 
in Maths\nW@, dfMaths.sum()*100/ (25*row) ,"%3") 


Percentage of Marks Scored by Raman in Maths 

9ra 

Here, notice that Raman was absent in Unit Test 4 in 
Maths Subject. While computing the percentage, marks 
of the fourth test have been considered as O. 


3.8.2 Dropping Missing Values 


Missing values can be handled by either dropping the 
entire row having missing value or replacing it with 
appropriate value. 

Dropping will remove the entire row (object) having 
the missing value(s). This strategy reduces the size of 
the dataset used in data analysis, hence should be used 
in case of missing values on few objects. The dropna() 
function can be used to drop an entire row from the 
DataFrame. For example, calling dropna() function on 
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the previous example will remove the 4th row having 
NaN value. 





>>> dfl = di.dropna() 
>>> Print (ar) 
Name UT Maths Science S.Sst Hindi Eng 


O Raman 1 22a) 2 wl) 18 20 21.0 
il Raman 2 Adie 20.0 17 22 24.0 
2 Raman 3 14.0 190 IRE 24 23.0 
4 Zuhaire 1 AU L70 oe 24 19.0 
3 Zuhaire r Zoe) 15.0 al 2o 10 
6 Zuhaire > ZA a) Oa 19 Ao hore 
7 Zuhaire 4 T940 Zo) 17 AA. “WIG. 4) 
8 Ashravy ii 2540 Egle) 20 LS 220 
9 Ashravy Z 24.0 ZA%0 24 17 21.0 
10 Ashravy 3 LU Zon 19 kL 20 
11 Ashravy 4 1540 20 eU PO 2m 17.0 
12 Mishti 1 Loe 22.40 R5 g2 LLQ 
To Mishti 2 osQ 21.0 25 24 23.0 
14 Mishti 3 ie ee, 18 % 20 25 2Q.0 
Re Mash Ca: 4 14.0 ZOO 19 20 18% 


Now, let us consider the following code: 
# marks obtained by Raman in all the unit tests 
>>> dfRaman=dt[df.Name=='Raman" | 


# inplace=true makes changes in the #original 
DataFrame i.e. dfRaman #here 


>>> dfRaman.dropna(inplace=True, how='any') 


>>> dfMaths = dfRaman['Maths'] # get the marks 
scored in Maths 


>>> print("\nMarks Scored by Raman in Maths 
\n",dfMaths) 


Marks Scored by Raman in Maths 


O Boa 
1 210 
2 14.0 
3 NaN 


Name: Maths, dtype: floato4 


>>> row = len(dfMaths) 


>>> print("\nPercentage of Marks Scored by 
Raman in Maths\n") 


>>> print (dfMaths.sum() *100/ (25*row) ,"%") 
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NOTES 


Percentage of Marks Scored by Raman in Maths 
TOU z% 


Note that the number of rows in dfRaman is 3 after 
using dropna. Hence percentage is computed from 
marks obtained in 3 Unit Tests. 


3.8.3 Estimating Missing Values 


Missing values can be filled by using estimations or 
approximations e.g a value just before (or after) the 
missing value, average/minimum/maximum of the 
values of that attribute, etc. In some cases, missing 
values are replaced by zeros (or ones). 

The fillna(num) function can be used to replace 
missing value(s) by the value specified in num. For 
example, fillna(O) replaces missing value by O. Similarly 
fillna(1) replaces missing value by 1. Following code 
replaces missing values by O and computes the 
percentage of marks scored by Raman in Science. 


#Marks Scored by Raman in all the subjects 
acros® the tests 


>>> dfRaman = df.loc[df['Name' |]=='Raman' | 
>>> (row,col) = dfRaman.shape 
>>> dfScience = dfRaman.loc[:,'Science'] 


a>> print ("Marke Sd@ored by Raman in Science 
\n\n", dfSc ®ance} 


Marks Scored by Raman in Science 


21.0 
20.0 
LISU 
NaN 
Name: Science, dtype: float64 


O N ==> 


>>> dfFillZeroScience = dfScience.fillna (0) 


>>> print('\nMarks Scored by Raman in Science 
with Missing Values Replaced with Zero\ 
n',dfFillZeroScience) 


Marks Scored by Raman in Science with Missing 
Values Replaced with Zero 


O 2l ©) 
1 2040 
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2 19.0 NOTES 
3 OU 
Name: Science, dtype: float64 


>>> print ("Percentage of Marks Scored by Raman 
in Science\n\n",dfFillZeroScience.sum()*100/ 
(25*roOw) "3" 


Percentage of Marks Scored by Raman in Science 

60.0 % 

df.fillna(method='pad') replaces the missing 
value by the value before the missing value while 
df.fillna(method='bfill') replaces the missing value by the 
value after the missing value. Following code replaces 
the missing value in Unit Test 4 of English test by the 
marks of Unit Test 3 and then computes the percentage 
of marks obtained by Raman. 

>>> GLEngG = diRaman, loc|:, Eng] 

>>> print ("Marks Scored by Raman»in English 

Vp \n",dafhmg) 


Marks Scored by Raman in English 


O gla ©) 
1 24.0 
2 3U 
3 NaN 


Name: Eng, dtype: floato4 


>>> dfFillPadEng = dfEng.fillna (method='pad') 


>>> print('\nMarks Scored by Raman in English 
with Missing Values Replaced by Previous Test 
Marks\n',dfFillPadEng) 


Marks Scored by Raman in English with Missing 
Values Replaced by Previous Test Marks 


O 2 Mig 
1 24.0 
2 230 
3 Zo) 


Name: Eng, dtype: floato4 


>>> print("Percentage of Marks Scored by Raman 
in English\n\n") 


>>> @rirmya (dfFillPadEng.sum() *100/ (25*row) ,"%3") 


Percentage of Marks Scored by Raman in English 

91 . OR 

In this section, we have discussed various ways 
of handling missing values. Missing value is loss of 
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information and replacing missing values by some 
estimation will surely change the dataset. In all cases, 
data analysis results will not be actual results but will 
be a good approximation of actual results. 


3.9 IMPORT AND EXPORT OF DATA BETWEEN PANDAS 
AND MYSQL 


So far, we have directly entered data and created 
a DataFrame and learned how to analyse data in a 
DataFrame. However, in actual scenarios, data need 
not be typed or copy pasted everytime. Rather, data is 
available most of the time in a file (text or csv) or in 
a database. Thus, in real-world scenarios, we will be 
required to bring data directly from a database and load 
to a DataFrame. This is called importing data from a 
database. Likewise, after analysis, we will be required to 
store data back to a database. This is called exporting 
data to a database. 

Data from DataFrame can be read from and written 
to MySQL database. To do this, a connection is required 
with the MySQL database using the pymysql database 
driver. And for this, the driver should be installed in the 
python environment using the following command: 


pip install pymysql 


sqlalchemy is a library used to interact with the 
MySQL database by providing the required credentials. 
This library can be installed using the following 
command: 
pip install sqlalchemy 


Once it is installed, sqlalchemy provides a function 
create_engine() that enables this connection to be 
established. The string inside the function is known as 
connection string. The connection string is composed of 
multiple parameters like the name of the database with 
which we want to establish the connection, username, 
password, host, port number and finally the name of 
the database. And, this function returns an engine 
object based on this connection string. The syntax for 
the same is discussed below: 

engine=create engine('driver:// 
username:password@host:port/name of _ 
database',index=false) 
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where, 

Driver = mysql+pymysql 

username=User name of the mysql (normally it is root) 
password= Password of the MySql 

port = usually we connect to localhost with port number 
3306 (Default port number) 

Name of the Database = Your database 


In the following subsections, importing and exporting 
data between Pandas and MySQL applications are 
demonstrated. For this, we will use the same database 
CARSHOWROOM and Table INVENTORY created in 
Chapter 1 of this book. 

mysql> use CARSHOWROOM ; 


Database changed 
mysql> select * from INVENTORY; 


+------- +-------- +----------- +----------- +----------------- +---------- + 
| CarId | CarName| Price | Model | YearManufacture | Fueltype | 
4------- +-------- +----------- +----------- +----------------- +---------- + 
| DOOL | Carl | 582613.00 | LXI | 2017 | Petrol | 
| DOO2 | Carl | 673112.00 | VXI | 2018 | Petrol | 
| BOOL | Car2 | 567031.00 | Sigmal.2 | 2019 | Petrol | 
| B002 | cCar2 | 647858.00 | Deltal.2 | 2018 | Petrol | 
| E001 | Car3 | 355205.00 | 5 STR STD | 2017 | CNG | 
| E002 | Car3 | 654914.00 | CARE | 2018 | CNG | 
| s001 | Car4 | 514000.00 | LXI | 2017 | Petrol | 
| s002 | car4 | 614000.00 | VXI | 2018 | Petrol | 
+------- +-------- +----------- +----------- +----------------- +---------- + 
8 rows in set (0.00 sec) 


3.9.1 Importing Data from MySQL to Pandas 


Importing data from MySQL to pandas basically refers 
to the process of reading a table from MySQL database 
and loading it to a pandas DataFrame. After establishing 
the connection, in order to fetch data from the table of 
the database we have the following three functions: 


1) pandas.read sql query(query,sql conn) 


It is used to read an sql query (query) into a 
DataFrame using the connection identifier (sql_ 
conn) returned from the create_engine (). 


2) pandas.read sql table(table name,sql_ conn) 


It is used to read an sql table (table_name) into a 
DataFrame using the connection identifier (sql_ 
conn). 


3) pandas.read sql(sql, sql conn) 


It is used to read either an sql query or an sql 
table (sql) into a DataFrame using the connection 
identifier (sql_conn). 
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>>> Import pandas as pa 

>>> import pymysql as py 

>>> import sqlalchemy 

>>> engine=create engine ('mysql+pymysql:// 
root:smsmb@localhost :3306/CARSHOWROOM') 
vor Of = pa.tead sql query (” SELECT * FROM 
INVENTORY', engine) 

>>> Print (CL) 


CarId CarName Price Model YearManufacture Fueltype 

O D001 Carl 30263-00 LXI 2017 Petrol 
1 D002 Car1 6/311200 VXI 201e Petrol 
2 BOO Car2 567031.00 Ssagmal.2 2019 Petrol 
> BOOZ Car2 647858.00 Deltal.2 2018 Petrol 
4 E001 Car3 239920900 5STR WD 2017 CNG 

5 BOOZ Car3 654914.00 CARE 2018 CNG 

6 S001 Car4 514000.00 LXI 2017 Petrol 
d S002 Car4 614000.00 VXI A016” Petrol 


3.9.2 Exporting Data from Pandas to MySQL 


Exporting data from Pandas to MySQL basically refers 
to the process of writing a pandas DataFrame to a table 
of MySQL database. For this purpose, we have the 
following function: 
pPagdas™PataFrame.t@ysqliftable,sql conn,if_ 
ex Neat s="fail”, giex: W Ssece/True) 

e Table specifies the name of the table in which we 
want to create or append DataFrame values. It is 
used to write the specified DataFrame to the table 
the connection identifier (sql_conn) returned from the 
create_engine (). 

e The parameter if_exists specifies “the way data from 
the DataFrame should be entered in the table. It 
can have the following three values: “fail”, “replace”, 
“append”. 

o “fail” is the default value that indicates a 
ValueError if the table already exists in the 
database. 


o “replace” specifies that the previous content of 
the table should be updated by the contents of 
the DataFrame. 


o “append” specifies that the contents of the 
DataFrame should be appended to the existing 
table and when updated the format must be the 
same (column name sequences). 
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e Index — By default index is True means DataFrame NOTES 
index will be copied to MySQL table. If False, then it 
will ignore the DataFrame indexing. 





#Code to write DataFrame df to database 


>>> import pandas as pd 
>>> import pymysgql as py 
>>> import sqlalchemy 


>>> engine=create engine ('mysql+pymysql:// 
root:smsmb@localhost: 3306/CARSHOWROOM'! ) 


>>> data={ 
"ShowRoomid!':[1,2,3,4,5], 


‘Location': [‘Delhi', 'Bangalore', 'Mumbai', 'Chand 
igarh', 'Kerala']} 


>>> df=pd.DataFrame (data) 


>>> df.to sql ('showroom info',engin E _ 
exists="replace", index=False) 


After running this python script, a mysql table 
with the name “showroom _ info” will be created in the 
database. 


SUMMARY 
ley OP FT 


e Descriptive Statistics are used to quantitatively 
summarise the given data. 


e Pandas provide many statistical functions for 
analysis of data. Some of the functions are max(), 
min(), mean(), median(), mode(), std(), var() etc. 


e Sorting is used to arrange data in a specified 
order, i.e. either ascending or descending. 


e Indexes or labels of a row or column can be 
changed in a DataFrame. This process is known 
as Altering the index. Two functions reset_index 
and set_index are used for that purpose. 


e Missing values are a hindrance in data analysis 
and must be handled properly. 


e There are primarily two main strategies for 
handling missing data. Either the row (or column) 
having missing value is removed completely from 
analysis or missing value is replaced by some 
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appropriate value (which may be zero or one or 


average etc.) 


e Process ofchanging the structure ofthe DataFrame 
is known as Reshaping. Pandas provide two basic 
functions for this, pivot() and pivot_table(). 


e pymysql and sqlalchemy are two mandatory 
libraries for facilitating import and export of data 
between Pandas and MySQL. Before import and 
export, a connection needs to be established from 
python script to MySQL database. 

e Importing data from MySQL to Panda refers to 
the process of fetching data from a MySQL table 
or database to a pandas DataFrame. 

e Exporting data from Pandas to MySQL refers to the 
process of storing data from a pandas DataFrame 
to a MySQL table or database. 


Exercise 





1. Write the statement to install the python connector to 
connect MySQL i.e. pymysal. 


2. Explain the difference between pivot() and pivot_ 
table() function? 

3. What is sqlalchemy? 

4. Can you sort a DataFrame with respect to multiple 
columns? 


5. What are missing values? What are the strategies to 
handle them? 


6. Define the following terms: Median, Standard 
Deviation and variance. 


7. What do you understand by the term MODE? Name 
the function which is used to calculate it. 


8. Write the purpose of Data aggregation. 


9. Explain the concept of GROUP BY with help on an 
example. 


10. Write the steps required to read data from a MySQL 
database to a DataFrame. 


11. Explain the importance of reshaping of data with an 
example. 
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12. Why estimation is an important concept in NOTES 
data analysis? 


13. Assuming the given table: Product. Write the python 
code for the following: 





rn] corny [term | 

TV LG 12000 700 
TV VIDEOCON 10000 650 
TV LG 15000 800 
AC SONY 14000 750 

a) To create the data frame for the above table. 

b) To add the new rows in the data frame. 

c) To display the maximum price of LG TV. 

d) To display the Sum of all products. 

e) To display the median of the USD of Sony 


products. 


f| To sort the data according to the Rupees and 
transfer the data to MySQL. 


g) To transfer the new dataframe into the MySQL 
with new values. 


14. Write the python statement for the following question 
on the basis of given dataset: 


Name Degree score 


O Aparna MBA 90.3.0 
1 Pankaj BCA NaN 
2 Ram M.Tech 80.0 
3 Ramesh MBA 98.0 
= Naveen NaN 97.0 
95 Krrishnav BCA 78.0 
6 Bhawna MBA 09.0 


a) To create the above DataFrame. 


b) To print the Degree and maximum marks in each 
stream. 


c) To fill the NaN with 76. 
d) To set the index to Name. 


e) To display the name and degree wise average 
marks of each student. 


f) To count the number of students in MBA. 


g) To print the mode marks BCA. 





104 INFORMATICS PRACTICES 


NOTES 


SOLVED CASE STUDY BASED ON OPEN DATASETS 

UCI dataset is a collection of open datasets, available 
to the public for experimentation and research 
purposes. ‘auto-mpg’ is one such open dataset. 

It contains data related to fuel consumption by 
automobiles in a city. Consumption is measured in 
miles per gallon (mpg), hence the name of the dataset 
is auto-mpg. The data has 398 rows (also known as 
items or instances or objects) and nine columns 
(also Known as attributes). 


The attributes are: mpg, cylinders, displacement, 
horsepower, weight, acceleration, model year, origin, 
car name. Three attributes, cylinders, model year 
and origin have categorical values, car name is a 
string with a unique value for every row, while the 
remaining five attributes have numeric value. 

The data has been downloaded from the UCI data 
repository available at http://archive.ics.uci.edu/ 
ml/machine-learning-databases/auto-mpg/. 


Following are the exercises to analyse the data. 
1) Load auto-mpg.data into a DataFrame autodf. 


2) Give description of the generated DataFrame 
autodf. 


3) Display the first 10 rows of the DataFrame 
autodf. 


4) Find the attributes which have missing values. 
Handle the missing values using following two 
ways: 


i. Replace the missing values by a value before 
that. 


ii. Remove the rows having missing values from 
the original dataset 


5) Print the details of the car which gave the 
maximum mileage. 


6) Find the average displacement of the car given 
the number of cylinders. 


7) Whatis the average number of cylinders in a car? 


8) Determine the no. of cars with weight greater 
than the average weight. 





